DS 6030 | Fall 2024 | University of Virginia

Homework #10: Density Estimation

Author

SOLUTIONS

Published

Fall 2024

Warning

These solutions are only for the private use of the students of DS 6030 Fall 2024. Sharing of the solutions, posting on the internet, selling to a company, or possession by anyone outside of the class constitutes a violation of the honor policy.

Required R packages and Directories

data_dir = 'https://mdporter.github.io/teaching/data' # data directory
library(ks)        # functions for KDE
library(tidyverse) # functions for data manipulation   

Problem 1 Geographic Profiling

Geographic profiling, a method developed in criminology, can be used to estimate the home location (roost) of animals based on a collection of sightings. The approach requires an estimate of the distribution the animal will travel from their roost to forage for food.

A sample of \(283\) distances that pipistrelle bats traveled (in meters) from their roost can be found at:

One probability model for the distance these bats will travel is: \[\begin{align*} f(x; \theta) = \frac{x}{\theta} \exp \left( - \frac{x^2}{2 \theta} \right) \end{align*}\] where the parameter \(\theta > 0\) controls how far they are willing to travel.

a. Derive a closed-form expression for the MLE for \(\theta\) (i.e., show the math).

Solution

This is the Rayleigh distribution, named after Lord Rayleigh.

The log-likelihood is: \[\begin{align*} \log L(\theta) = \sum_{i=1}^n \log X_i - n \log \theta - \frac{1}{2\theta} \sum_{i=1}^n X_i^2 \end{align*}\]

Taking derivative wrt \(\theta\) gives:

\[\begin{align*} \frac{d \log L}{d \theta} = -\frac{n}{\theta} + \frac{1}{2\theta^2} \sum_{i=1}^n X_i^2 \end{align*}\]

Setting this equal to zero:

\[\begin{align*} \frac{n}{\theta} &= \frac{1}{2\theta^2} \sum_{i=1}^n X_i^2 \\ 2n \theta &= \sum_{i=1}^n X_i^2 \\ \hat{\theta} = \frac{1}{2n} \sum_{i=1}^n X_i^2 \end{align*}\]

b. Estimate \(\theta\) for the bat data using MLE?

Calculate using the solution to part a, or use computational methods.

Solution
#: Read in data
data = readr::read_csv(file.path(data_dir, 'geo_profile.csv'),
                    col_names = FALSE)   
x = data$X1  # convert to vector

#: Calculate MLE
n = length(x)
theta = sum(x^2)/(2*n)  

The MLE is \(\hat{\theta} = 4.52\).

c. Plot the estimated density

Using the MLE value of \(\theta\) from part b, calculate the estimated density at a set of evaluation points between 0 and 8 meters. Plot the estimated density.

  • The x-axis should be distance and y-axis should be density (pdf).
Solution
#: Function to estimate Rayleigh density
drayleigh <- function(x, theta) (x/theta) * exp(-x^2/(2*theta))
                      
#: Calculate density on grid using MLE parameter value
theta_hat = sum(x^2)/(2*n) 
plot_data = tibble(eval.pts = seq(0, 8, length=200),
                   density = drayleigh(eval.pts, theta=theta_hat))

#: Plot using ggplot2
ggplot(plot_data, aes(eval.pts, density)) + 
  geom_line() +                            # plot with density
  annotate("rug", x=x) +                   # add rug
  labs(x='distance (meters)', y='density') # change labels

d. Estimate the density using KDE.

Report the bandwidth you selected and produce a plot of the estimated density.

Solution

There is no right answer to the bandwidth, especially because you are not told what we are going to do with the density estimate; thus even a visual reason is sufficient.

The ks package gives a few options for 1D bandwidth selection (hpi, hlscv, hucv) and any of these would be sufficient as well. As stated in the ks manual, normal kernels are used and the bandwidth corresponds to the standard deviation of the kernel.

Note: The density estimate should have non-negative support since you can’t observe negative distances. The kde() function has an argument positive=TRUE which would give the necessary edge corrections (in this case log transform); however there is no bandwidth selector that incorporates this constraint.

library(ks)  # uses normal kernel

#: Choose bandwidth with least squares cross-validation
(bw = hlscv(x))
[1] 0.35
#: Estimate the density
dens_kde = kde(x, h=bw)
dens_pos = kde(x, h=bw, positive = TRUE)  # log transform 
#: Plot using ggplot2
tibble(x) %>% 
  ggplot(aes(x)) + 
  geom_rug() + 
  geom_histogram(aes(y=after_stat(density)), alpha=.6) + # add histogram
  geom_density(bw=bw, color="red") +               # add basic kde
  geom_line(data = tibble(x=dens_pos$eval.points, y=dens_pos$estimate), 
            aes(x, y), color="purple")             # add edge-corrected density  

#: Plot the results using base R:
plot(dens_kde, las=1)
lines(dens_pos$eval.points, dens_pos$estimate, las=1, col="purple") # edge correction 
rug(x)

e. Which model do you prefer, the parametric or KDE? Why?

Solution

Here again, there is no right answer. Some advantages of the parametric include:

  • Theoretical support for Rayleigh distribution.
  • Simple, one parameter model is easy to explain and doesn’t require much data to fit.
  • Has correct non-negative support

However, given sufficient data and a suitable bandwidth, the KDE method will perform almost as well, even if the data is really from a Rayleigh distribution. One drawback is that the KDE method, unless corrected, can assign density to distances less than 0. The more events that are close to the boundary, the worse this problem will be.

The plots below will allow a visual comparison.

library(tidyverse)

#: make data for plotting
hist_data = tibble(x)
dens_data = tibble(x_seq = seq(0, 8, length=200), 
                   rayleigh = drayleigh(x_seq, theta_hat), 
                   kde = kde(x, h=bw, eval.points=x_seq)$estimate, 
                   kde.pos = kde(x, h=bw, eval.points=x_seq, positive=TRUE)$estimate)
#: Method 1: ggplot2
ggplot() + 
  geom_histogram(data=hist_data, aes(x, after_stat(density)), 
                 binwidth=.25, 
                 boundary = 0,
                 alpha = .5) + 
  geom_rug(data=hist_data, aes(x)) + 
  geom_line(data=dens_data, aes(x_seq, rayleigh, color = "Rayleigh")) + 
  geom_line(data=dens_data, aes(x_seq, kde, color = "KDE")) + 
  geom_line(data=dens_data, aes(x_seq, kde.pos, color = "KDE-positive")) +   
  labs(x = 'distance (miles)', color = "model")

#: Method 2: base R
hist(x, breaks = seq(0, 8, by=.25), freq=FALSE, las=1, 
     main="Density estimation", xlab='distance (miles)', ylab='density')
lines(dens_data$x_seq, dens_data$rayleigh, col="blue")
lines(dens_data$x_seq, dens_data$kde, col="red")
lines(dens_data$x_seq, dens_data$kde.pos, col="purple")

A more rigorous test, in this situation, would be to compare their predictive performance using cross-validation or a hold-out set of events.

#: Monte Carlo cross-validation

# Using negative log density as the loss function
loss <- function(f.x) -sum(log(f.x))

M = 100 
hout = 10
Loss = tibble()
set.seed(2021)
for(m in 1:M){
  #: sample data
  ind_eval = sample(n, hout)
  x_eval = x[ind_eval]
  x_fit = x[-ind_eval]
  
  #: fit models
  theta_hat = sum(x_fit^2)/(2*length(x_fit)) # MLE for Rayleigh distribution
  bw = ks::hlscv(x_fit)                      # bandwidth for KDE model
  
  #: estimate density on hold-out observations
  f_rayleigh = drayleigh(x_eval, theta_hat)
  f_kde = kde(x_fit, h = bw, eval.points=x_eval)$estimate
  f_kde.pos = kde(x_fit, h = bw, eval.points=x_eval, positive=TRUE)$estimate
  
  #: evaluate predictions
  Loss = bind_rows(
    Loss,
    tibble(
      rayleigh = loss(f_rayleigh), 
      kde = loss(f_kde), 
      kde_pos = loss(f_kde.pos),
      iter = m
    )
  )
}

# average loss
Loss %>% summarize(across(-iter, mean))
# A tibble: 1 × 3
  rayleigh   kde kde_pos
     <dbl> <dbl>   <dbl>
1     17.0  17.2    17.4

It turns out that the Rayleigh distribution had the lowest loss (i.e., the largest out-of-sample likelihood).

Problem 2: Interstate Crash Density

Interstate 64 (I-64) is a major east-west road that passes just south of Charlottesville. Where and when are the most dangerous places/times to be on I-64? The crash data (link below) gives the mile marker and fractional time-of-week for crashes that occurred on I-64 between mile marker 87 and 136 in 2016. The time-of-week data takes a numeric value of <dow>.<hour/24>, where the dow starts at 0 for Sunday (6 for Sat) and the decimal gives the time of day information. Thus time=0.0417 corresponds to Sun at 1am and time=6.5 corresponds to Sat at noon.

a. Crash Data

Extract the crashes and make a scatter plot with mile marker on x-axis and time on y-axis.

Solution
X = read_csv(file.path(data_dir, "crashes16.csv"))
ggplot(X, aes(mile, time)) + geom_point()

b. Use KDE to estimate the mile marker density.

  • Report the bandwidth.
  • Plot the density estimate.
Solution
library(ks)
#: Use smoothed cross-validation bandwidth
(bw.mile = hscv(X$mile))
[1] 2.92
kde.mile = kde(X$mile, h=bw.mile)
#: Plot
ggplot(X, aes(mile)) + 
  geom_density(bw=bw.mile) + 
  geom_rug() + 
  xlim(c(80, 145))

plot(kde.mile, las=1); rug(X$mile)  # Base R plot

A mile marker bandwidth of 2.92 corresponds to the standard deviation of a Gaussian kernel. Thus, the density at mile marker \(110\) is practically effected by events within about 3 bandwidths or 8.75 miles. So only events between mile markers 101.25 and 118.75 have a practical impact on the density estimate at mile marker \(110\).

c. Use KDE to estimate the temporal time-of-week density.

  • Report the bandwidth.
  • Plot the density estimate.
Solution
library(ks)
#: Use smoothed cross-validation bandwidth
(bw.time = hscv(X$time))
[1] 0.412
kde.time = kde(X$time, h=bw.time)
#: Plot
ggplot(X, aes(time)) + 
  geom_density(bw=bw.time) + 
  geom_rug() + 
  xlim(c(-1,8))

plot(kde.time, las=1); rug(X$time) # Base R plot

A temporal (time-of-week) bandwidth of 0.41 corresponds to the standard deviation of a Gaussian kernel. Thus, the density on Tuesday at 6am (time=2.25) is practically effected by events within about 3 bandwidths or 1.23 days. So only events occurring between 1.02 and 3.48.

Note: The time axis is better represented as circular instead of linear such that events occurring near the boundaries of 0 and 7 should have more influence on each other. If you are interested in this idea check out the term Directional Statistics.

d. Use KDE to estimate the bivariate mile-time density.

  • Report the bandwidth parameters.
  • Plot the bivariate density estimate.
Solution
#: Estimate density. Use smooth cross-validation
(H = Hscv(X))
       [,1]   [,2]
[1,] 25.992 -0.152
[2,] -0.152  0.366
kde.mv = kde(X, H=H)
#: Make plot data
plot_data = expand_grid(
              mile = seq(min(X$mile), max(X$mile), length=50),
              time = seq(min(X$time), max(X$time), length=50)
            ) %>% 
  mutate(density = predict(kde.mv, x=as.matrix(.)))

#: Plot
ggplot(plot_data, aes(mile, time)) + 
  geom_point(data=X, color="grey80") + 
  geom_contour(aes(z=density), color="black") 

# Note: there is a built-in ggplot function named geom_density_2d(), but this
#  currently only allow separable kernel (off-diagonal elements are 0)
plot(kde.mv, las=1, xlim=c(87, 136), ylim=c(0, 7))
grid(col='lightblue', lty=1)
points(X, pch=19, cex=.5, col="grey80")

The bandwidth provides an understanding of how “local” the density estimate is. We can plot a 95% Gaussian confidence ellipse around point (110, 2.25) to see the kernel shape and size.

library(mixtools)
plot(X)
points(c(110, 2.25), col="blue", pch="X")
mixtools::ellipse(mu=c(110,2.25), sigma=H, alpha = .05, npoints = 250, col="red") 

Notice that the multivariate kernel bandwidth is estimated to be almost separable (i.e., the off-diagonal term is close to 0) producing an axis aligned ellipse. Also, the multivariate bandwidths are larger than the univariate equivalents; note that the square root of the diagonals of \(H\) correspond to the bandwidhts of the univariate.

sqrt(diag(H))
[1] 5.098 0.605

e. Crash Hotspot

Based on the estimated density, approximate the most dangerous place and time to drive on this stretch of road. Identify the mile marker and time-of-week pair (within a few miles and hours).

Solution

Here is a colored contour plot with the contour levels set at 5%, 25%, and 50%:

plot(kde.mv, las=1, xlim=c(87, 136), ylim=c(0, 7), 
     cont=c(5, 25, 50), 
     display='filled.contour')
abline(v = seq(86, 140, by=2), col='lightblue')
abline(h = seq(0, 7, by=.25), col='lightblue')
points(X, pch=19, cex=.5, col="grey80")

There appear to be 3 visual ‘hotspots’:

  1. (102, 5.6): Afton mountain on Friday afternoon/evening
  2. (117, 1.7): Charlottesville area on Monday afternoon/evening
  3. (119, 4.6): Charlottesville area on Thursday afternoon/evening